DiTASiC (Differential Taxa Abundance including Similarity Correction) is designed as a comprehensive approach for abundance estimation and differential abundance assessment of individual taxa in metagenomics samples. It provides resolution on the strain level, which is characterized by the presence of highly similar genome sequences. The challenge arises that many reads match to multiple genomes equally, causing a significant bias of abundance counts.
A generalized linear model is introduced for the resolution of the shared read counts and allows inference of accurate taxa abundances. In addition, abundance estimation uncertainties are captured, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level. No prior assumptions on overall abundance change are required. A resulting list of tested taxa is reported with estimated abundances, fold-changes and p-values to infer significance.
DiTASiC requires Python 3 and R installation (version ≥ 3.3.1).
- ‘NumPy’ and ‘Biopython’ need to be installed for Python.
- R packages optparse, RcppCNPy, and AER are required for R.
Internally, the read simulator Mason and the pseudoaligner kallisto are used and need to be available in the PATH or DiTASiC’s runtime directory. Binaries for both programs for x86-64 systems are provided in the .tar-package for easy installation.
The following input files are required to run the pipeline:
- Metagenomic read samples: fastq or fasta files
- Reference sequences: fasta files
- A Refs_path file, containing the absolute paths of all considered taxa Reference fasta files in this analysis: txt file
- A kallisto index file of the reference sequences is required. If not available yet, it will be automatically built during the ‘similarity matrix’ creation step of DiTASiC.
Alternatively, it can be manually generated by the user by executing the following command (see also kallisto Manual) :
kallisto index -i index_name Reference-FASTA-files
DiTASiC comprises four main commands:
Creating the similarity matrix of the reference sequences.
The key element is to imitate read and mapping characteristics as good as possible to reflect the source of ambiguities in the matrix. At minimum, a parameter determining the read length is obligatory, corresponding to the average read length found in the samples considered. Further parameters defining mismatch probabilities are optional, but are recommended (see Manual). Additionally, a file REFS_PATH holding the paths to the fasta-files of the references is required.
ditasic_matrix -l 100 REFS_PATH
similarity_matrix.npy, a square N x N - matrix for the N references listed in the given path-file REFS_PATH.
ditasic_matrixalso generates a kallisto index of all references, if no index is given via the -i flag.
Mapping the reads to the reference genome sequences.
In the mapping step, all reads of a sample are preliminarily assigned to the given reference sequences. For the mapping, internally, a pseudo-aligner as part of the kallisto framework (command kallisto pseudo, (see ‘Formats’ in Manual)) is used. Subsequently, the number of read hits per reference genome (both unique and shared hits) are counted and stored in a vector of length N according to the number of references considered.
ditasic_mapping -i INDEX -l 100 REFS_PATH SAMPLE
Default output: a numpy vector
SAMPLE_mapped_counts.npyand a file
SAMPLE_total.npywith the total number of reads mapped. This mapping step needs to be run individually for each sample.
Taxa abundance estimation in one sample.
Apply this shorter command for estimation of taxa abundances in one sample. If you are considering more samples and their comparison go to step (4).
The mapping counts obtained in previous step (2) are strongly biased in the presence of similar genome sequences. An abundance resolution down to strain/sub-strain level is provided by a GLM resolving the shared read counts. The files generated by the previous commands (1-2) serve as input: similarity matrix, mapping counts and total number of mapped reads. Optional arguments concerning a filtering step to identify absent taxa can be found in the Manual.
ditasic -r REFS_PATH -a similarity_matrix.npy -x SAMPLE1_mapped_counts.npy -n SAMPLE1_total.npy -o OUTPUT_filename [options]
Default output: a tab-separated file, listing the taxa with abundance count estimates along with corresponding error estimates, reporting the range of accuracy of the abundance estimates, and absent taxa flagged.
Taxa abundance estimation and differential assessment between two samples.
Here, abundances of taxa are estimated for two metagenomic samples and are additionally compared. The GLM model is applied for abundance resolution. Subsequently, a statistical framework identifies which taxa significantly change their abundance from one metagenome to another and which hold a constant abundance.
The same input data is used as in (3), but is extended by mapping counts and total number of reads of a second sample.
ditasic -r REFS_PATH -a similarity_matrix.npy -x SAMPLE1_mapped_counts.npy -n SAMPLE1_total.npy -y SAMPLE2_mapped_counts.npy -m SAMPLE2_total.npy -o OUTPUT_filename [options]
Default output: a tab-separated file, listing all taxa considered with estimated abundances for both samples, log2 fold-changes for differential evaluation and adjusted p-values to infer significance of differential abundance.
Further details about all available command-line options can be found in the Manual.
You can download a directory with sample data and a script executing a typical DiTASiC workflow here. It contains 35 bacterial reference sequences and two samples of reads that were generated from the references. The example script estimates the abundance of the first sample and the differential abundance between the first and the second sample, while taking care of all preliminary steps such as similarity matrix creation or mapping of the samples. The script
example.sh requires the
ditasic_mapping.py scripts to be available in your PATH.
The sample script demonstrates the three essential steps of DiTASiC and allows inspecting all temporary files that are created in a successful run of the program. The final program output can be compared to the expected output and should be identical.